## [1] "/Users/dcyoung23/Documents/Project P4 - Explore and Summarize Data"
Load data:
pl <- read.csv('prosperLoanData.csv')
113,937 observations of 81 variables in the dataset.
Isolate specific colums for further data exploration:
## [1] "Term"
## [2] "LoanStatus"
## [3] "BorrowerRate"
## [4] "EstimatedLoss"
## [5] "EstimatedReturn"
## [6] "ProsperRating..numeric."
## [7] "ProsperRating..Alpha."
## [8] "ProsperScore"
## [9] "ListingCategory..numeric."
## [10] "BorrowerState"
## [11] "Occupation"
## [12] "EmploymentStatusDuration"
## [13] "IsBorrowerHomeowner"
## [14] "CreditScoreRangeLower"
## [15] "OpenCreditLines"
## [16] "OpenRevolvingAccounts"
## [17] "OpenRevolvingMonthlyPayment"
## [18] "AmountDelinquent"
## [19] "DelinquenciesLast7Years"
## [20] "PublicRecordsLast10Years"
## [21] "RevolvingCreditBalance"
## [22] "BankcardUtilization"
## [23] "TradesNeverDelinquent..percentage."
## [24] "DebtToIncomeRatio"
## [25] "IncomeRange"
## [26] "StatedMonthlyIncome"
## [27] "LoanOriginalAmount"
## [28] "LoanOriginationDate"
## [29] "LoanOriginationQuarter"
## [30] "MonthlyLoanPayment"
## [31] "LP_InterestandFees"
## [32] "LP_GrossPrincipalLoss"
## [33] "LP_NetPrincipalLoss"
## [34] "LP_NonPrincipalRecoverypayments"
Summary statistics for specified columns:
## Term LoanStatus BorrowerRate
## Min. :12.00 Current :56576 Min. :0.0000
## 1st Qu.:36.00 Completed :38074 1st Qu.:0.1340
## Median :36.00 Chargedoff :11992 Median :0.1840
## Mean :40.83 Defaulted : 5018 Mean :0.1928
## 3rd Qu.:36.00 Past Due (1-15 days) : 806 3rd Qu.:0.2500
## Max. :60.00 Past Due (31-60 days): 363 Max. :0.4975
## (Other) : 1108
## EstimatedLoss EstimatedReturn ProsperRating..numeric.
## Min. :0.005 Min. :-0.183 Min. :1.000
## 1st Qu.:0.042 1st Qu.: 0.074 1st Qu.:3.000
## Median :0.072 Median : 0.092 Median :4.000
## Mean :0.080 Mean : 0.096 Mean :4.072
## 3rd Qu.:0.112 3rd Qu.: 0.117 3rd Qu.:5.000
## Max. :0.366 Max. : 0.284 Max. :7.000
## NA's :29084 NA's :29084 NA's :29084
## ProsperRating..Alpha. ProsperScore ListingCategory..numeric.
## :29084 Min. : 1.00 Min. : 0.000
## C :18345 1st Qu.: 4.00 1st Qu.: 1.000
## B :15581 Median : 6.00 Median : 1.000
## A :14551 Mean : 5.95 Mean : 2.774
## D :14274 3rd Qu.: 8.00 3rd Qu.: 3.000
## E : 9795 Max. :11.00 Max. :20.000
## (Other):12307 NA's :29084
## BorrowerState Occupation EmploymentStatusDuration
## CA :14717 Other :28617 Min. : 0.00
## TX : 6842 Professional :13628 1st Qu.: 26.00
## NY : 6729 Computer Programmer : 4478 Median : 67.00
## FL : 6720 Executive : 4311 Mean : 96.07
## IL : 5921 Teacher : 3759 3rd Qu.:137.00
## : 5515 Administrative Assistant: 3688 Max. :755.00
## (Other):67493 (Other) :55456 NA's :7625
## IsBorrowerHomeowner CreditScoreRangeLower OpenCreditLines
## False:56459 Min. : 0.0 Min. : 0.00
## True :57478 1st Qu.:660.0 1st Qu.: 6.00
## Median :680.0 Median : 9.00
## Mean :685.6 Mean : 9.26
## 3rd Qu.:720.0 3rd Qu.:12.00
## Max. :880.0 Max. :54.00
## NA's :591 NA's :7604
## OpenRevolvingAccounts OpenRevolvingMonthlyPayment AmountDelinquent
## Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 4.00 1st Qu.: 114.0 1st Qu.: 0.0
## Median : 6.00 Median : 271.0 Median : 0.0
## Mean : 6.97 Mean : 398.3 Mean : 984.5
## 3rd Qu.: 9.00 3rd Qu.: 525.0 3rd Qu.: 0.0
## Max. :51.00 Max. :14985.0 Max. :463881.0
## NA's :7622
## DelinquenciesLast7Years PublicRecordsLast10Years RevolvingCreditBalance
## Min. : 0.000 Min. : 0.0000 Min. : 0
## 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 3121
## Median : 0.000 Median : 0.0000 Median : 8549
## Mean : 4.155 Mean : 0.3126 Mean : 17599
## 3rd Qu.: 3.000 3rd Qu.: 0.0000 3rd Qu.: 19521
## Max. :99.000 Max. :38.0000 Max. :1435667
## NA's :990 NA's :697 NA's :7604
## BankcardUtilization TradesNeverDelinquent..percentage. DebtToIncomeRatio
## Min. :0.000 Min. :0.000 Min. : 0.000
## 1st Qu.:0.310 1st Qu.:0.820 1st Qu.: 0.140
## Median :0.600 Median :0.940 Median : 0.220
## Mean :0.561 Mean :0.886 Mean : 0.276
## 3rd Qu.:0.840 3rd Qu.:1.000 3rd Qu.: 0.320
## Max. :5.950 Max. :1.000 Max. :10.010
## NA's :7604 NA's :7544 NA's :8554
## IncomeRange StatedMonthlyIncome LoanOriginalAmount
## $25,000-49,999:32192 Min. : 0 Min. : 1000
## $50,000-74,999:31050 1st Qu.: 3200 1st Qu.: 4000
## $100,000+ :17337 Median : 4667 Median : 6500
## $75,000-99,999:16916 Mean : 5608 Mean : 8337
## Not displayed : 7741 3rd Qu.: 6825 3rd Qu.:12000
## $1-24,999 : 7274 Max. :1750003 Max. :35000
## (Other) : 1427
## LoanOriginationDate LoanOriginationQuarter MonthlyLoanPayment
## 2014-01-22 00:00:00: 491 Q4 2013:14450 Min. : 0.0
## 2013-11-13 00:00:00: 490 Q1 2014:12172 1st Qu.: 131.6
## 2014-02-19 00:00:00: 439 Q3 2013: 9180 Median : 217.7
## 2013-10-16 00:00:00: 434 Q2 2013: 7099 Mean : 272.5
## 2014-01-28 00:00:00: 339 Q3 2012: 5632 3rd Qu.: 371.6
## 2013-09-24 00:00:00: 316 Q2 2012: 5061 Max. :2251.5
## (Other) :111428 (Other):60343
## LP_InterestandFees LP_GrossPrincipalLoss LP_NetPrincipalLoss
## Min. : -2.35 Min. : -94.2 Min. : -954.5
## 1st Qu.: 274.87 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 700.84 Median : 0.0 Median : 0.0
## Mean : 1077.54 Mean : 700.4 Mean : 681.4
## 3rd Qu.: 1458.54 3rd Qu.: 0.0 3rd Qu.: 0.0
## Max. :15617.03 Max. :25000.0 Max. :25000.0
##
## LP_NonPrincipalRecoverypayments
## Min. : 0.00
## 1st Qu.: 0.00
## Median : 0.00
## Mean : 25.14
## 3rd Qu.: 0.00
## Max. :21117.90
##
Created new variables for loan origination month and loan origination year and a bar graph by month with a facet by year. The dataset ranges from November 2005 to March 2014. There is a gap in data between November 2008 to June 2009.
A google search of “Prosper loan November 2008” and the 1st hit is a TechCrunch article regarding the SEC shut down of peer-to-peer lender Prosper that stopped all lending.
The metadata makes references to several columns only available after July 2009. CreditGrade is applicable for listings pre-2009 and all of the estimated credit yields as well as what appears to be a new prosper rating/scoring system starting in July 2009. This is key information that will be considered in subsequent data exploration and creation of predictive models.
## Source: local data frame [3 x 5]
##
## Term LoanAmtVolume LoanAmtMean LoanAmtMedian LoanCnt
## 1 36 638686342 7276.155 5000 87778
## 2 60 303631410 12370.398 11500 24545
## 3 12 7576595 4694.297 3500 1614
Loan term options are 1, 3 and 5 years. 1 and 5 year loans were not made until 2011. 3 year loans remained the most popular choice.
Data is normally distributed but with a spike in loan volume around .31. Median of .1840 and Mean of .1928.
Data is normally distributed. Median of .092 and Mean of .096. Added x limits of 0 and .2 to remove long tail primarily for negative returns to provide a better visualization of the distribution.
Data closely resembles a plateau distribution with multiple peaks and valleys. Median of .072 and Mean of .080.
Data is normally distributed. Median of 4.00 and Mean of 4.072. The minimum and maximum rating is 1 and 7 respectively. From worst to best, the alpha scale ranges from HR to AA.
## Source: local data frame [11 x 5]
##
## ProsperScore LoanAmtMean LoanAmtMedian LoanAmtVolume LoanCnt
## 1 1 4570.955 4000 4534387 992
## 2 2 5279.778 4000 30443202 5766
## 3 3 7062.552 4500 53972021 7642
## 4 4 8401.920 7500 105822181 12595
## 5 5 8400.081 7000 82429995 9813
## 6 6 9222.604 8000 113235137 12278
## 7 7 10097.153 9500 106999534 10597
## 8 8 10487.978 10000 126411602 12053
## 9 9 10055.976 8300 69496847 6911
## 10 10 11742.895 10000 55778753 4750
## 11 11 14858.186 15000 21633519 1456
Data is normally distributed with similar peaks at 4, 6 and 8. The maximum score is 11 which is inconsistent with the metadata that indicates score ranges from 1 to 10, with 10 being the best.
Data is positively skewed for shorter employment lengths. Added a log10 and square root transformation arranged in a 1 column grid. Median of 67.00 and Mean of 96.07.
Data is normally distributed. Median of 680.0 and Mean of 685.6. Filtered out records with invalid credit score = 0 to remove left long tail. Used credit score range lower since rate qualification is typically based on the lower score in multi-credit scoring pricing model. Added spectral color pallette for Bad to Excellent credit score ranges.
Data is positively skewed. Added tick marks every 5000 since peaks show loan amounts appear to be more common in 5K increments (5K - 25K). Median of 6500.0 and Mean of 8337.0.
Data is positively skewed. Added x axis limit to exclude long tail for monthly payments greater than 99% quantile. Median of 217.7 and Mean of 272.5.
Top 10 loan volume by BorrowerState:
## Source: local data frame [51 x 5]
##
## BorrowerState LoanAmtVolume LoanAmtMean LoanAmtMedian LoanCnt
## 1 CA 132075153 8974.326 7000 14717
## 2 TX 62179088 9087.853 7500 6842
## 3 NY 59437488 8833.034 7000 6729
## 4 FL 55154135 8207.461 6500 6720
## 5 IL 49712307 8395.931 6500 5921
## 6 GA 41881214 8362.862 6000 5008
## 7 OH 33904448 8078.258 6500 4197
## 8 MI 27469230 7645.207 5250 3593
## 9 VA 29408372 8971.437 7500 3278
## 10 NJ 29511373 9529.019 8000 3097
## .. ... ... ... ... ...
Added map plot to show dominant CA market.
Loan volume by custom ListingCategoryGroup variable:
## Source: local data frame [9 x 5]
##
## ListingCategoryGroup LoanAmtVolume LoanAmtMean LoanAmtMedian LoanCnt
## 1 Debt Consolidation 577736197 9908.352 9500 58308
## 2 Not Available 106096621 6253.853 4500 16965
## 3 Other 69719962 6131.923 4000 11370
## 4 Household 71197236 7503.925 5000 9488
## 5 Business Use 64175191 8926.859 7279 7189
## 6 Personal Use 28095040 5502.358 4000 5106
## 7 Vehicle 15718287 5216.823 4000 3013
## 8 Medical 10447136 6476.836 4000 1613
## 9 Taxes 6708677 7580.426 5000 885
Debt consolidation loans are the most common loan type representing more than 50% of the total.
Summary of LoanStatus and new calculated LoanStatusBucket variable:
## Cancelled Chargedoff Completed
## 5 11992 38074
## Current Defaulted FinalPaymentInProgress
## 56576 5018 205
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 16 806 265
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 363 313 304
## Cancelled Closed Open
## 5 55084 58848
New variable PrincipalLoss flag:
##
## 0 1
## 97291 16646
The principal Loss variable will be used to identify loans that have defaulted and any principal amount was charged off.
New variable BorHomeowner flag:
##
## 0 1
## 56459 57478
Converted True/False text field.
What is the structure of your dataset?
There are 113,937 loans in the dataset with 81 total columns. Based on existing domain knowledge of the lending industry, I immediately isolated specific columns for further analysis. However, as I conducted the single variable analysis I went back and made revisions to my column list. For example, once I made the determination that the credit rating system changed in July 2009 I added each of the Prosper Rating/Score variables and removed Credit Grade.
Loan origination volume range from November 2005 to March 2014. There is no data between November 2008 to June 2009 due to the SEC shut down of Prosper. Loan term options are 1, 3 and 5 years. Prosper did not make 1 and 5 year loans until 2011 but the 3 year loan remained the most popular choice.
Prosper has an ordered factor variable for credit rating as well as an accompanying numeric variable and custom risk score.
Worst -> Best
Prosper rating alpha: HR, E, D, C, B, A, AA
Prosper rating numeric: 1 - 7
Prosper score: 1 - 11
All rating/risk score fields are NA pre-2009 so the datset will be filtered accordingly. Based on the variable definitions the maximum risk score was expected to be 10.
Median rate is 18.4% with spike in volume at 31%.
Median estimated return and loss is 9.2% and 7.2% respectively.
Median employment duration is 67 months.
Median credit score lower is 680. Median loan amount is $6500.
CA has the highest volume of loans and Debt Consolidation is the top listing category.
What is/are the main feature(s) of interest in your dataset?
The main features of interest in the dataset rate, loan amount and prosper rating My perspective is from an investor point of view and investigating the relationship of customer profiles and probability of the loan defaulting and loss of principal.
What other features in the dataset do you think will help support your investigation into your feature(s) of interest?
The features that will be explored further are:
Did you create any new variables from existing variables in the dataset?
The dataset includes estimated return and loss rates but these are assigned at the time the loan listing was created. The dataset does not have average daily balance data to calculate actual rates so net principal loss was used to calculate a principal loss flag.
In addition, new variables were created for loan origination month and year, credit score category (Bad, Poor, Fair, Good, Excellent), listing category group and loan status bucket (Cancelled, Open, Closed).
Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?
The positive skew of employment status duration was transformed using log 10 and standard deviation. Due to the gap in data between November 2008 to June 2009 and introduction of the prosper rating and risk score metrics, the dataset will be filtered for loans >= 2009 for analysis on these fields.
Rate has an unusual spike in loan counts at 31% with median values of 18.40% and Mean of 19.28%. Estimated loss has what closely resembles a plateau distribution with multiple peaks at similar heights.
Prosper rating and prosper score are normally distributed but the prosper score has unusual peaks at 4, 6 and 8. This will be explored further but will focus on prosper rating in subsequent analysis.
Correlation matrix 1 for rate, credit rating and key loan fields:
## BorRate EstLoss EstRet ProspRate
## BorRate 1.0000000000 0.94529248 0.81767854 -0.95310541
## EstLoss 0.9452924831 1.00000000 0.59105633 -0.96418058
## EstRet 0.8176785371 0.59105633 1.00000000 -0.65998874
## ProspRate -0.9531054149 -0.96418058 -0.65998874 1.00000000
## ProspScore -0.6497455552 -0.67371541 -0.38326305 0.70522276
## CreditScore -0.5086563005 -0.51123953 -0.34620303 0.54884709
## Term -0.0000762274 -0.10712738 0.15258177 0.07915118
## EmpDur -0.0391820185 -0.03916047 -0.03648651 0.03607306
## Income -0.0934771070 -0.08925538 -0.07500608 0.09430718
## LoanAmt -0.4135014440 -0.42995566 -0.28608272 0.42861533
## BorHomeowner -0.1261559744 -0.12712601 -0.08613725 0.13645771
## PrincipalLoss 0.2387775340 0.21135582 0.25191666 -0.19520939
## ProspScore CreditScore Term EmpDur
## BorRate -0.649745555 -0.50865630 -0.0000762274 -0.039182018
## EstLoss -0.673715407 -0.51123953 -0.1071273760 -0.039160473
## EstRet -0.383263051 -0.34620303 0.1525817714 -0.036486506
## ProspRate 0.705222758 0.54884709 0.0791511793 0.036073061
## ProspScore 1.000000000 0.36960569 0.0289477016 -0.007302109
## CreditScore 0.369605692 1.00000000 0.0502557933 0.029313163
## Term 0.028947702 0.05025579 1.0000000000 0.052555496
## EmpDur -0.007302109 0.02931316 0.0525554963 1.000000000
## Income 0.083777108 0.06770702 0.0092381215 0.051380442
## LoanAmt 0.266386099 0.27786723 0.3390362058 0.078216651
## BorHomeowner 0.064430345 0.27692463 0.0760532815 0.155986596
## PrincipalLoss -0.062738593 -0.06963978 -0.0459869079 -0.036297552
## Income LoanAmt BorHomeowner PrincipalLoss
## BorRate -0.093477107 -0.41350144 -0.12615597 0.23877753
## EstLoss -0.089255381 -0.42995566 -0.12712601 0.21135582
## EstRet -0.075006076 -0.28608272 -0.08613725 0.25191666
## ProspRate 0.094307176 0.42861533 0.13645771 -0.19520939
## ProspScore 0.083777108 0.26638610 0.06443034 -0.06273859
## CreditScore 0.067707020 0.27786723 0.27692463 -0.06963978
## Term 0.009238121 0.33903621 0.07605328 -0.04598691
## EmpDur 0.051380442 0.07821665 0.15598660 -0.03629755
## Income 1.000000000 0.18283792 0.12027047 -0.04688319
## LoanAmt 0.182837920 1.00000000 0.17783717 -0.12620284
## BorHomeowner 0.120270472 0.17783717 1.00000000 -0.03704747
## PrincipalLoss -0.046883194 -0.12620284 -0.03704747 1.00000000
In the top left of the correlation matrix, rate, estimated loss, estimated return, prosper rating, prosper score and credit score all have high correlation and will be explored further. Loan amount and principal loss are the additional fields of interest.
Correlation matrix 2 for rate, credit rating and key credit reporting fields:
## BorRate ProspRate ProspScore CreditScore RevPmt
## BorRate 1.00000000 -0.95346747 -0.65657254 -0.52854934 -0.06348382
## ProspRate -0.95346747 1.00000000 0.71059189 0.56864137 0.05193759
## ProspScore -0.65657254 0.71059189 1.00000000 0.38700058 0.01799715
## CreditScore -0.52854934 0.56864137 0.38700058 1.00000000 0.05658146
## RevPmt -0.06348382 0.05193759 0.01799715 0.05658146 1.00000000
## RevBal -0.06298067 0.05991414 0.05360869 0.05534997 0.73248926
## CardUtil 0.25520817 -0.28093930 -0.25744548 -0.44309234 0.30342525
## DTI 0.12642797 -0.13534359 -0.14533589 -0.01370880 0.12308564
## AmtDlq 0.05485593 -0.05344622 -0.04223792 -0.05137617 -0.05162808
## TotalTrades -0.05316765 0.04959010 -0.01792112 0.09049613 0.40908969
## Dlq7 0.15018934 -0.15500180 -0.10399108 -0.22023679 -0.19232792
## PublicRecord10 0.11825403 -0.12504463 -0.08678053 -0.22029099 -0.18770705
## LoanAmt -0.40541402 0.41975417 0.26446056 0.28580939 0.24613633
## PrincipalLoss 0.22883249 -0.18520392 -0.05485596 -0.07422851 -0.06615066
## RevBal CardUtil DTI AmtDlq
## BorRate -0.06298067 0.255208165 0.12642797 0.05485593
## ProspRate 0.05991414 -0.280939297 -0.13534359 -0.05344622
## ProspScore 0.05360869 -0.257445475 -0.14533589 -0.04223792
## CreditScore 0.05534997 -0.443092336 -0.01370880 -0.05137617
## RevPmt 0.73248926 0.303425250 0.12308564 -0.05162808
## RevBal 1.00000000 0.262865719 0.04485603 -0.02471830
## CardUtil 0.26286572 1.000000000 0.05744962 -0.02237036
## DTI 0.04485603 0.057449615 1.00000000 -0.02694389
## AmtDlq -0.02471830 -0.022370358 -0.02694389 1.00000000
## TotalTrades 0.26572616 0.070371237 0.07937152 0.03118406
## Dlq7 -0.12852593 -0.044270080 -0.06634706 0.22074222
## PublicRecord10 -0.14576905 -0.001275848 -0.04997642 0.03996097
## LoanAmt 0.18488691 -0.031467564 -0.01783746 -0.03381422
## PrincipalLoss -0.03966827 -0.036125729 0.03149852 0.01147444
## TotalTrades Dlq7 PublicRecord10 LoanAmt
## BorRate -0.05316765 0.150189336 0.118254033 -0.40541402
## ProspRate 0.04959010 -0.155001803 -0.125044632 0.41975417
## ProspScore -0.01792112 -0.103991082 -0.086780531 0.26446056
## CreditScore 0.09049613 -0.220236790 -0.220290987 0.28580939
## RevPmt 0.40908969 -0.192327918 -0.187707048 0.24613633
## RevBal 0.26572616 -0.128525932 -0.145769051 0.18488691
## CardUtil 0.07037124 -0.044270080 -0.001275848 -0.03146756
## DTI 0.07937152 -0.066347055 -0.049976416 -0.01783746
## AmtDlq 0.03118406 0.220742216 0.039960974 -0.03381422
## TotalTrades 1.00000000 0.109393082 -0.027460673 0.15630683
## Dlq7 0.10939308 1.000000000 0.243622496 -0.11199231
## PublicRecord10 -0.02746067 0.243622496 1.000000000 -0.10524710
## LoanAmt 0.15630683 -0.111992312 -0.105247095 1.00000000
## PrincipalLoss -0.05948305 0.009173272 0.017404246 -0.12288087
## PrincipalLoss
## BorRate 0.228832488
## ProspRate -0.185203923
## ProspScore -0.054855959
## CreditScore -0.074228510
## RevPmt -0.066150663
## RevBal -0.039668271
## CardUtil -0.036125729
## DTI 0.031498522
## AmtDlq 0.011474438
## TotalTrades -0.059483047
## Dlq7 0.009173272
## PublicRecord10 0.017404246
## LoanAmt -0.122880873
## PrincipalLoss 1.000000000
Due to number variables in the dataset, this correlation matrix now explores the various credit reporting fields and relationship with rate, prosper rating, prosper score and credit score. Card utilization and credit score as well as total trades/revolving balance and revolving payment have a good relationship but no other variables stand out for further exploration.
It was assumed that most of these credit reporting fields would have some factor in the credit score so it is interesting to see that card utilization has the highest correlation of all credit related fields.
3 year loans are the most common across all loan amounts.
I will now explore rate and estimated loss, estimated return, prosper rating, credit score, loan amount and income.
As rate increases, estimated loss for the loan increases. At rates at 20% and higher there are vertical bands for higher loss estimates. This will be explored in the multivariate analysis section to identify the types of loans that represent these higher loss estimates.
As rate increases, estimated return for the loan increases. At rates at 20% and higher there are vertical bands for lower return estimates. This is consistent since as the estimated loss increases, estimated return decreases. This will be explored in the multivariate analysis section to identify the types of loans that represent these lower return estimates. It is noted some loans have estimated negative returns.
Prosper rating scale is a discrete range from 1 to 7 so you get overplotting on the vertical bands for each rating. There is a negative linear relationship with higher amount of outliers for lower ratings. This plot is not very effective due to discrete nature of the rating scale.
The boxplot is a better plot to visualize the data for rating and rate due to the discrete rating scale.
The credit score range lower variable is represented in increments of 20 so you get overplotting on the vertical bands for each score. Although there is a negative non-linear relationship between credit score and rate, the data really shows that credit score most likely is not the sole factor in the customers rate.
Using the custom credit score category field it buckets the credit scores to get a more natural visual instead of the vertical bands in the scatterplot. The boxplot wiskers also show the wide ranging rates for each credit score category.
There is a non-linear relationship between loan amount and rate. As loan amount increases, the density and range of rates at the top end decreases.
Summary data for Income Range:
## Source: local data frame [8 x 5]
##
## IncomeRange LoanAmtMean LoanAmtMedian LoanAmtVolume LoanCnt
## 1 Not displayed 5169.649 3033 40018253 7741
## 2 $1-24,999 4273.974 4000 31088885 7274
## 3 Not employed 4884.829 4000 3937172 806
## 4 $0 7410.931 5000 4602188 621
## 5 $25,000-49,999 6177.987 5000 198881762 32192
## 6 $50,000-74,999 8675.276 7500 269367313 31050
## 7 $75,000-99,999 10365.924 9700 175349966 16916
## 8 $100,000+ 13073.127 12000 226648808 17337
Median rate decreases slightly as income range increases. Income level is unknown for Not displayed so unable to determine where it would truly fall in the range. There does not appear to be a significant relationship between income level and rate.
I’ll take a look at the relationship of income and loan amount now.
Median loan amount increases as income range increases. Excluded outliers $0 and Not employed from box plots. Stated monthly income Median of 4667 and Mean of 5608.
Next I will explore prosper rating and credit score and loan amount.
Median credit score is higher for the lowest 1 - HR credit rating than 2 - E. This suggests some other credit factor could be the determining factor between these ratings. Median credit score does increase as credit rating increases from credit rating 2 - E to 7 - AA.
## Source: local data frame [7 x 6]
##
## ProsperRating..numeric. LoanAmtMean LoanAmtMedian LoanAmtMax
## 1 1 3463.114 4000 16800
## 2 2 4586.405 4000 15900
## 3 3 7083.439 6100 15000
## 4 4 10391.940 10000 25000
## 5 5 11622.355 10000 35000
## 6 6 11459.886 10000 35000
## 7 7 11583.539 10940 35000
## Variables not shown: LoanAmtVolume (dbl), LoanCnt (int)
Similar to income range, median loan amount increases for higher credit rating. The boxplot shows the maximum loan amounts by credit rating. For credit ratings >= 5 max loan amount is 35,0000, Credit rating = 4 max loan amount is 25,000 and <= 3 betwee 15,000 - 17,000. Loan amount caps can minimize credit loss exposure in the event of default. I’m really interested in exploring the benefits of loan amount limits based on the borrowers credit rating.
Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?
Rate has a strong positive linear relationship with estimated loss and estimated return with a correlation coefficient of .95 and .82 respectively. It has a strong negative linear relationship with prosper rating at -.95. The relationship with the custom risk prosper score was not as strong as expected at -.65 and confirmed some suspicion about the distribution of this field in the univariate analysis section. It was a little surprising the relationship with credit score was not as strong at -.51.
Loans with the highest stated monthly income have the highest median loan amount and lowest median rate. This is consistent with expectations that people that make more money can afford larger loan payments. Although the disparity in rates is not large it also shows the more money you make the better rate you will get. Is this a product of making more money or better credit ratings?
Loans with the highest credit rating have the highest median loan amount and lowest median rate. What would be the driver for lower loan amounts for lower credit ratings? The disparity in rates is much larger for credit rating.
Loan amount has the strongest relationship with prosper rating at .43 and is the only field that had any significant relationship with the Term of the loan at .34.
Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?
In the other features, I really was expecting stronger relationships with the selected credit reporting fields and credit score. Card utilization was the strongest at -.44. delinquency last 7 and public record last 10 were the next at -.22 for both. This really moved my focus to the prosper rating field since it inherently represented the credit worthiness of the borrower.
What was the strongest relationship you found?
The strongest relationship was prosper rating and rate and estimated loss at -.95 and -.96 respectively. Due to limitations in the dataset, I could not calculate actual loss rates so I focused on defaulted loans that had any amount of principal amount charged off.
I will now revisit the scaterplots in the Bivariate Plots section and add color by prosper rating.
The vertical bands are clearly associated with higher loss estimates for the worst HR credit rating.
Inverting the plot now for return estimates shows the same relationship. The really interesting part about this plot is some HR credit rating loans have negative estimated returns.
Adding prosper rating to the credit score and rate scatterplot really highlights the relationship of the credit rating system and interest rates across ranges of credit scores.
This plot is another visual that highlights at lower credit ratings loan amounts have maximum limits. The plot shows only ratings >=5 have loan amounts up to the maximum of 35,000. I’m interested in exploring this further in a predictive model to identify how loan amount limits for each credit rating can minimize the probability of loss.
Rate and monthly income have a weak relationship but adding the credit rating tells the story about rates across similar income levels.
I will now explore relationships with the calculated variable principal loss. I have narrowed down my variables of interest to borrower rate, prosper rating and loan amount.
Prosper rating histogram by loans with a principal loss. Filtered for all loans originated >= 2009 and closed. Including loans that are still open would skew the analysis.
The rate spike at .31 is predominantly HR credit rating loans.
The density of loans with a principal loss is much higher with rates >= .225.
Added fill for prosper rating to the density plot for borrower rate and principal loss. This plot is a great visual to show the relationship ofrating and rate. The density curves are distinct along the axis for each rating. It is interesting to see the spike for no principal loss loans at .35 but a significant spike for 1 - HR credit rating at .31.
Loans with lower credit ratings 1 - 3 have a higher distribution in the loan amount <= 5000 bins.
The density of loans with a principal loss is much higher with loan amounts between 2500 - 5000 and then follows a very similar curve as loan amount approaches 35000.
Added fill for prosper rating to the density plot for loan amount and principal loss. The spike around 5K for 1 - HR credit rating loans with a principal loss really stands out in this plot.
##
## Call:
## glm(formula = PrincipalLoss ~ ProsperRating..numeric. + LoanOriginalAmount,
## family = "binomial", data = myData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2315 -0.8205 -0.6077 -0.2586 2.6404
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.929e-01 3.611e-02 -19.191 < 2e-16 ***
## ProsperRating..numeric.2 -1.923e-01 4.830e-02 -3.981 6.87e-05 ***
## ProsperRating..numeric.3 -5.642e-01 4.601e-02 -12.263 < 2e-16 ***
## ProsperRating..numeric.4 -1.000e+00 5.522e-02 -18.110 < 2e-16 ***
## ProsperRating..numeric.5 -1.319e+00 6.112e-02 -21.575 < 2e-16 ***
## ProsperRating..numeric.6 -1.829e+00 6.709e-02 -27.266 < 2e-16 ***
## ProsperRating..numeric.7 -2.811e+00 1.221e-01 -23.027 < 2e-16 ***
## LoanOriginalAmount 4.877e-05 3.615e-06 13.490 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 28508 on 26004 degrees of freedom
## Residual deviance: 26768 on 25997 degrees of freedom
## (144 observations deleted due to missingness)
## AIC: 26784
##
## Number of Fisher Scoring iterations: 5
## OR 2.5 % 97.5 %
## (Intercept) 0.50011510 0.46586568 0.53670266
## ProsperRating..numeric.2 0.82508646 0.75053635 0.90698774
## ProsperRating..numeric.3 0.56883903 0.51978321 0.62251279
## ProsperRating..numeric.4 0.36785391 0.33001927 0.40978487
## ProsperRating..numeric.5 0.26747524 0.23715001 0.30136256
## ProsperRating..numeric.6 0.16052769 0.14060745 0.18291358
## ProsperRating..numeric.7 0.06017423 0.04703614 0.07594495
## LoanOriginalAmount 1.00004877 1.00004167 1.00005584
## LoanOriginalAmount ProsperRating..numeric. Pred
## 1 4500 1 0.38379544
## 2 4500 2 0.33945219
## 3 4500 3 0.26160811
## 4 4500 4 0.18640534
## 5 4500 5 0.14280350
## 6 4500 6 0.09089479
## 7 4500 7 0.03612486
The line plot with the confidence interval ribbon shows how the predicted probability for loss increases as loan amount increases and credit rating decreases. In addition the confidence interval gets wider.
See the multivariate and final plots section for additional comments on this model.
Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?
I first expanded the scatterplots from the bivariate section to include prosper rating. estimated loss and estimated return had a clear correlation with credit rating. As credit rating decreases, estimated loss increases and return decreases due to their direct relationship (return is the difference between effective yield and loss). In the rate and loan amount by prosper rating scatterplot, all credit rating levels have loan amounts between 0 and 10000 but the lower the credit rating the higher the rate. Only the top 4 credit ratings have loan amounts >= 25000.
For the calculated column principal loss, the strongest relationship was with rate, estimated loss, estimated return, prosper rating and loan amount. The density plot for rate and principal loss shows a higher density for rates >= .225. I added a fill by credit rating and this really popped with the curves higher for no principal loss and credit rating 5 - 7 and spikes significantly at a rate of .31 for credit rating 1.
Were there any interesting or surprising interactions between features?
Credit score and credit rating relationship was interesting in that I expected a more distinct line between rating and the credit score ranges. Although median credit scores are higher for higher credit ratings, the scatterplot showed a lot of overlap in scores across ratings.
Income amount was really all over the place with no clear relationship but there is the appearance of a higher representation of higher income amounts for higher credit ratings.
OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.
I created a logistic regression model for principal loss as the dependent variable and independent variables prosper Rating and loan amount. I did not include borrower rate as well since it has such strong negative linear relationship with prosper rating.
I initially used a random sample of 500 loans to get a better sense of the true p values for the variables. The first dataset that I ran through the model was for the median loan amount of 4500 for all 7 credit ratings. The probability of a loan defaulting and any amount of principal being charged off is 38.38% and 3.61% for credit ratings 1 - HR and 7 - AA respectively.
The second dataset was simulated for loan amounts from 1000 to 35000 to build a 95% confidence interval plot for the predicted probabilities.
Plot 1
Description 1
This boxplot provides the best visual for median loan amounts for each credit rating and the whiskers of the boxplots show the maximum loan amounts.
Plot 2
Description 2
Initially I planned on calculating actual return and loss rates but due to the absence of daily average balance data you really cannot calculate actual rates. You can calculate simple rates based on the Interest and Fees and Non-Principal Recovery Payments variables and Loan Amount. However, I decided to focus my analysis on the derived binary variable for Principal Loss. These plots provide a great visual of the linear relationship between Borrower Rate and Estimated Loss/Return with the outliers for the 1 - HR credit rating.
I did some customization of the final plot to arrange the plots together with only 1 x axis label and legend guide positioned at the bottom.
Plot 3
Description 3
The line plot with the prediction probability confidence interval ribbon for the logistic regression model was the culmination of my analysis. This model could be the basis for establishing loan amount limits for each credit rating. From an investor perspective, listings could be run through the model to identify the probability of loss of principal to diversify a portfolio of peer to peer loan investments.
The Prosper loan dataset is pretty large with 113,937 loans with 88 variables ranging in loan originations from 2005 - 2014. Based on the number of variables my first step was to review the variable definitions and header data to reduce the columns down to a more manageable list. I then discovered the gap in data between November 2008 - June 2009 due to the SEC shutdown. This was an important observation since it was critical for my subsequent data subsets since the credit rating system changed after they reopened in 2009. Once I explored the estimated loss and return variables my immediate thought was to calculate actual loss and return rates for closed loans. However, I decided not to go down that path since the dataset only had original loan amount and lacked time series daily average loan balance data. I then focused my attention on exploring the variables that had the strongest relationship to rate and loan amount and ultimately which loans defaulted and had any amount of principal charged off. I eliminated fields such as term, monthly income and employment duration since they did not have strong relationships with my features of interest. I really honed in on the prosper credit rating in the bivariate section based on the output of each correlation matrix. At that point, I really established my direction and brought it all together in the multivariate section and creation of the predictive model.
Initially I was not creating values for each plot and during the course of my analysis I struggled a little keeping everything straight on what plots I had already created. By using a specific naming convention this really helped keep my project organized and I could review what values I had in the environment already.
My plan is to continue on with this analysis but on Prospers main competitor Lending Club to see if there is any measurable performance differences between the two peer to peer lenders.